The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.
We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.
Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.
We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.
| model | pass1 | win_rate | elo | |
|---|---|---|---|---|
| 0 | gpt-4-1106-preview | 0.733 | 0.814 | 1285.135 |
| 1 | meta-llama-3-70b-instruct | 0.690 | 0.775 | 1244.763 |
| 2 | opencodeinterpreter-ds-33b | 0.685 | 0.756 | 1225.049 |
| 3 | white-rabbit-neo-33b-v1 | 0.669 | 0.722 | 1191.603 |
| 4 | opencodeinterpreter-ds-6.7b | 0.664 | 0.721 | 1193.917 |
| 5 | deepseek-coder-6.7b-instruct | 0.656 | 0.696 | 1167.416 |
| 6 | xwincoder-34b | 0.648 | 0.683 | 1156.001 |
| 7 | bigcode--starcoder2-15b-instruct-v0.1 | 0.651 | 0.679 | 1149.482 |
| 8 | HuggingFaceH4--starchat2-15b-v0.1 | 0.646 | 0.672 | 1145.369 |
| 9 | mixtral-8x22b-instruct-v0.1 | 0.643 | 0.664 | 1138.261 |
| 10 | starcoder2-15b-oci | 0.632 | 0.652 | 1128.275 |
| 11 | CohereForAI--c4ai-command-r-plus | 0.635 | 0.649 | 1125.543 |
| 12 | speechless-starcoder2-15b | 0.624 | 0.635 | 1116.710 |
| 13 | Qwen--Qwen1.5-72B-Chat | 0.616 | 0.618 | 1100.714 |
| 14 | deepseek-coder-6.7b-base | 0.587 | 0.566 | 1056.786 |
| 15 | codegemma-7b-it | 0.569 | 0.534 | 1030.377 |
| 16 | speechless-starcoder2-7b | 0.563 | 0.512 | 1014.060 |
| 17 | databricks--dbrx-instruct | 0.558 | 0.505 | 1004.553 |
| 18 | microsoft--Phi-3-mini-4k-instruct | 0.542 | 0.479 | 982.766 |
| 19 | codegemma-7b | 0.524 | 0.458 | 966.646 |
| 20 | octocoder | 0.497 | 0.413 | 929.118 |
| 21 | mixtral-8x7b-instruct | 0.497 | 0.411 | 926.871 |
| 22 | codegemma-2b | 0.466 | 0.371 | 893.445 |
| 23 | open-hermes-2.5-code-290k-13b | 0.458 | 0.356 | 880.735 |
| 24 | gemma-1.1-7b-it | 0.450 | 0.340 | 865.755 |
| 25 | starcoder2-3b | 0.439 | 0.323 | 851.507 |
| 26 | gemma-7b | 0.434 | 0.319 | 848.045 |
| 27 | codegen-6b | 0.429 | 0.309 | 837.629 |
| 28 | mistral-7b | 0.421 | 0.292 | 820.887 |
| 29 | codet5p-2b | 0.381 | 0.244 | 774.521 |
| 30 | mistralai--Mistral-7B-Instruct-v0.2 | 0.370 | 0.236 | 766.272 |
| 31 | codegen-2b | 0.360 | 0.211 | 741.190 |
| 32 | gemma-7b-it | 0.328 | 0.193 | 718.705 |
| 33 | gemma-2b | 0.341 | 0.193 | 721.893 |